The data set in this exploratory data analysis contains observations of 1599 different samples of red wine associated with the levels of the red wine quality and 12 original attributes. We added 1 more attribute ‘total.acidity’ to represent the sum of three types of acids in the data set.
The dimension of the data set is listed as below:
## [1] 1599 13
The name and type of each variable are shown as below:
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ total.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
To get a better understanding of the data structure, it is necessary to have a look at the descriptive statistics of each variable.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## sulphates density pH alcohol
## Min. :0.3300 Min. :0.9901 Min. :2.740 Min. : 8.40
## 1st Qu.:0.5500 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.: 9.50
## Median :0.6200 Median :0.9968 Median :3.310 Median :10.20
## Mean :0.6581 Mean :0.9967 Mean :3.311 Mean :10.42
## 3rd Qu.:0.7300 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:11.10
## Max. :2.0000 Max. :1.0037 Max. :4.010 Max. :14.90
## quality total.acidity
## Min. :3.000 Min. : 4.60
## 1st Qu.:5.000 1st Qu.: 7.10
## Median :6.000 Median : 7.90
## Mean :5.636 Mean : 8.32
## 3rd Qu.:6.000 3rd Qu.: 9.20
## Max. :8.000 Max. :15.90
The summary depicts a rough distribution of values of each variable. We noticed:
by looking at maximum values, there are a few extreme outliers in the first 8 additive density attributes, i.e. in the attributes from ‘fixed.acidity’ to ‘total.sulfur,dioxide’.
values of most attributes, except for ‘total.sulfur.dioxide’, do not change significantly from the 1st quantile to the 3rd quantile, which means the majority of red wine samples have similar ingredients.
Next, we will use histograms to have better sense of the distributions of each attributes of red wine.
Brief introduction : The density of fixed acidity contains most acids involved in red wine, which do not evaporate readily. The majority of fixed acids is tartaric acid, which plays an important role in maintaining the chemical stability of the wine and its color and finally in influencing the taste of the finished wine.
Distribution: It has a slightly long-tail distribution in our data set. A popular range of density of fixed acidity is approximately from \(6.5 \sim 9.0\) \(g/dm^3\). There are a few outliers with fixed acidity density below \(5\) \(g/dm^3\) or above \(13\) \(g/dm^3\).
Brief introduction : From Wikipedia we know that acetic acid in wine, often referred to as volatile acidity (VA) or vinegar taint, can be contributed by many wine spoilage yeasts and bacteria. The volatile acidity represents the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
Distribution: The volatile acidity density distribution seems to be an unclear bimodal distribution. One mode is around \(0.4\) \(g/dm^3\) and another is around \(0.6\) \(g/dm^3\). We can also notice a high peak around \(0.5\) \(g/dm^3\). In the meanwhile, this distribution also has a few outliers above \(1.0\) \(g/dm^3\).
Further interests: It would be intereting to further explore the quality of red wine sample groups with the two popular volatile acidity, i.e. around \(0.4\) \(g/dm^3\) and around \(0.6\) \(g/dm^3\), along with the amount of antimicrobial agent (sulfur dioxide).
Brief introduction: Citric acid is an inexpensive supplements which can be used to boost the wine’s total acidity. It can add aggressive citric flavors to the wine. In the European Union, use of citric acid for acidification is prohibited.
Distribution: The distirbution of citric acid density looks like an exponential distribution. Most of red wine contains a samll amount of citric acid. However, there are four tall vertical bars when citric acid density reaches \(0\) \(g/dm^3\), \(0.02\) \(g/dm^3\), \(0.24\) \(g/dm^3\), and \(0.48\) \(g/dm^3\). The peak at \(0\) \(g/dm^3\) is consistent with European Union’s usage prohibition. The majority range of citirc acid density is from \(0\) \(g/dm^3\) to \(0.7\) \(g/dm^3\).
Further interests: The reasons for the other three high peaks are unknown yet. Since citric acid is added to boost wine’s total acitity, we may later explore the underlying reason by looking at the relationship between total acidity density and citric density.
The total density of acids is dominated by fixed acids, and its histogram looks almost the same as the histogram of fixed acids.
Brief introduction: The variable pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). Total acidity tells us the concentration of acids present in wine whereas the pH level tells us how intense those acids taste.
Distribution: The distribution of counts of pH is normal with a mean as \(3.311\). Most red wine samples in our data set have pH values from \(3.0\) to \(3.6\).
Further interests: pH is an index of acids taste, which can be decreased by sweetness by the website Understanding Acidity in Wine. We would like to further dig into to relationship between red wine quality and pH in groups of different sweetness level.
Brief introduction: ‘residual.sugar’ represents the amount of sugar remaining after fermentation stops. From How Basic Wine Characteristics Help You Find Favorites, we know that the level of sweetness is one important aspect of red wine quality.
Distribution: The original distribution in the left plot is highly left-skewed, so we used a log transformation on the density of residual sugar to get the right plot. On the right hand side, we find the log transformation of residual.sugar makes the distribution looks normal distributed. The center is around \(2.1\) \(g/dm^3\). The main range of residual sugar density is from \(1.2\) \(g/dm^3\) to \(3.0\) \(g/dm^3\).
Further interests: Since there are a few samples with extremely high density of residual sugar, we would like to divide the whole data set into 3 different groups with residual sugar level from low to high, and then explore the red wine quality.
Brief introduction: ‘chlorides’ represents the total amount of salt in the red wine. From Chloride concentration in red wines: influence of terroir and grape type, we know that wine contains salts of mineral acids, along with some organic acids, and they may have a key role on a potential salty taste of a wine, with chlorides being a major contributor to saltiness. Moderate to large concentrations of chlorides and sodium might give the wine a salty flavor which may turn way potential consumers.
Distribution: The counts distribution of chlorides density had also been transformed by taking logarithm because of the highly left-skewed property. The main range of the density is from \(0.05\) \(g/dm^3\) to \(0.12\) \(g/dm^3\) with a center around \(0.08\) \(g/dm^3\).
Further interests: Similar to the analysis of density of residual sugar, we would like to divide the whole data set into three different saltness level from low density to high density to further explore the relationship between red wine quality and chlorides density.
Brief introduction: The total sulfur dioxide is consist of amount of free and bound forms of sulfur dioxide. The free form of sulfur dioxide exists in equilibrium between molecular sulfur dioxide (as a dissolved gas) and bisulfite ion. Meanwhile, sulphates contribute to sulfur dioxide gas levels. Sulfur dioxide prevents microbial growth and the oxidation of wine.
Distribution: The log transformation of these three variable about sulfur dioxide have normal counts distribution. The free sulfur dioxide distribution has two peaks around \(6\) \(g/dm^3\) and around \(15\) \(g/dm^3\) with main range from \(3\) \(g/dm^3\) to \(50\) \(g/dm^3\). The total sulfur dioxide distribution is centered around \(45\) \(g/dm^3\) with main range from \(8\) \(g/dm^3\) to \(140\)\(g/dm^3\). The sulphate distribution is centerd around \(0.58\) \(g/dm^3\) withmain rangefrom \(0.4\) \(g/dm^3\) to \(1.2\) \(g/dm^3\).
Further interests: It would be interesting to think about how these three variables about sulfur dioxide works together to influence red wine quality.
Brief introduction: The alcohol density in a red wine the one of the most important property of its quality.
Distribution: The alcohol percentage has long-tail distribution mainly from \(9\%\) to \(14\%\). There is no need to take logarithm of alcohol percentage duo to its samll range.
Brief introduction: The quality level from poor to good is represented as integers from low to high.
Distribution: In the histogram, we observed that the majority of the red wine sample are in quality levels from 5 to 6, which are of the middle level, which is reasonable in red wine market.
This dataset has 1599 observation and 12 original variables. The variables can be divided into 4 part:
variables linked to acids: ‘fixed.acidity’, ‘volatile.acidity’, ‘citric.acid’, ‘pH’
varibales linked to other main features: ‘residual.sugar’, ‘alcohol’, ‘density’,‘chlorides’
variables linked to additives: ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’, ‘sulphates’
main variables: ‘quality’
Summar of Univariate Analysis: By the preliminary analysis above, we observed that most of the red wine samples have alcohol percentage from \(9\%\) to \(14\%\), pH from \(3.0\) to \(3.6\), residual sugar density from \(1\) \(g/dm^3\) to \(3\) \(g/dm^3\) and total acids density from \(5\) \(g/dm^3\) to \(14\) \(g/dm^3\).
In this analysis, the main interests would be ‘quality’.
Based on the red wine background knowledage we searched, we may intuitively speculate that the variables of main feature of the wine taste such as ‘alcohol’, ‘residual.sugar’, ‘chlorides’, ‘pH’, ‘fixed.acidity’ and ‘volatile.acidity’ will mainly contributed to the red wine quality. The additive attirbutes ‘citric.acid’ and three variables about sulfur dioxide may also influence the red wine quality in some degree.
Yes, the ‘total.acidity’ represents the sum of three differnt acids density in the red wine samples data set.
of the data? If so, why did you do this?
Many variables such as ‘residual.sugar’, ‘alcohol’, ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’ and ‘sulphates’ are long-tail distributions. We transfomed the x-axis scale to ‘log10’ scale to make the distribution more symmetric.
Here we will investigate the relationship between pairs of any two variables by the correation matrix below.
From above, we can draw conclusions:
the correlation coefficient beween ‘quality’ and ‘alcohol’ equals 0.495, which is relatively high.
‘pH’ is highly correlated to ‘fixed_acidity’ and ‘citric.acid’ with as -0.633 and -0.551. But the coefficient between ‘pH’ and ‘volatile.acidity’ is 0.236, which is positive. This positive correlation coefficient is beyond the intuitive thinking, because acids contirbute to low pH value.
‘sulphates’ has a higher relative correlation coefficient, -0.253, with ‘volatile’ than ‘free.sulfur.dioxide’ and ‘total.sulfur.dioxide’, which implies that sulphates density contributes more in reducing volatile acids.
Now, we start to explore relationships between ‘quality’ and other variables.
Volatile acids and quality:
The correlation coefficient between ‘quality’ and ‘volatile.acidity’ is \(-0.35\). We can infer that the density if volatile acids decrease red wine quality level significantly. Let’s have a better look at distribution of volatile acid denstiy in different quality levels.
We can observe that the red wine samples with higher quality have samller range of volatile acids density. For red wine in low quality level (level 3 to 4), the mean and median of volatile acids are above 0.6. For medium level red wine (level 5 to 6), the median and mean of volatile acids are located in \((0.5,0.6)\). Meanwhile, the high level red wine (level 7 to 8) have median and mean around 0.5.
Recall that the distribution of ‘volatile.acidity’ is bimodal in previous analysis. This is because important variables to influence density of volatile acidity have not been discovered yet. Since sulphate has a relatively high correlation coefficients with volatile acids density. We would like to plot the distibution of volatile acids density in two different sulphates levels.
From above, we can see that the level of sulphates density plays a critical role in identifing different volatile acidity density poplulations.
Balance of sourness and sweetness:
Since sourness and sweetness are two important tastes in red wine quality, we want to explore the influence of these two tastes.
From these two boxplots, we noticed that the total acidity, representing sourness, and residual sugar, representing sweetness, have similar distibution patterns. From level 3 to level 7, we observed a increasing trend in sourness and sweetness. However, we also observed a drop in both sourness and sweetness from level 7 to level 8. Since sweetness can be masked by sourness in red wine, it is reasonable to calculate how many times total acidity is the density of residual sugar. The result is shown as follows.
This plot makes more sense thant the previous two. We observed that when red wine quality levels higher than 4 , the median and mean of the mutiples are almost the same around \(3.75\). This is futher explained in the right figure where you can see red wine with quality level higher than 4 (‘mid’ and ‘high’ groups) has linear regression line with similar slope. This migh be a reasonable ratio between sourness and sweetness.
Chlorides and quality:
As we have analyzed above, moderate to large concentrations of chlorides and sodium might give the wine a salty flavor which may turn way potential consumers. In addition, the correlation of ‘chlorides’ and ‘quality’ is \(-0.161\). Hence, we want to explore the relationship between chlorides and quality.
By checking the distribution of chlorides density in different quality level, we noticed that the in lowest quality level (level 3) red wine tends to have a lager range of chlorides density and the higest mean of chlorides densitt. In quality level from 4 to 7, the chlorides density has similar distribution considering the majority of the red wine samples. In the highest level (level 8), red wine has the samllest range of chlorides desity and the lowest mean of chlorides desity.
Sulfur dioxide:
Since the correlation coefficient between ‘free.sulfur.dioxide’ and ‘quality’ is only \(-0.0723\), we only plotted the distributions of ‘total.sulfur.dioxide’ and ‘sulphates’ in different quality levels. From the left figure, we noticed that the sulfur dioxide density increased from level 3 to level 6 and then decreased from level 6 to level 8. An intuitive speculation would be that red wine in median level used a larger amount of sulfur dioxide as antimicrobials, but red wine in high level might have used other better antimicrobial methods rather than adding sulfur dioxide. From the right figure, we see a monotonous increasing trend of sulphates density as red wine quality level goes up. This suggests that the sulphates contribute to have a good red wine quality.
Alcohol:
Alcohol density has the highest correlation coefficients with red wine quality as \(0.495\). Hence it is the relevant attribute of red wine quality.
As you can see, alcohol percentage does not significantly contribute to a good red wine quality when red wine has a low quality level (level 3 to 4). The reason might be that red wine in low quality level did not go through a natural fermentation. Instead, The winemaker may added artificial alcohol to the red wine. Consider the red wine quality from level 5 to level 8, you will find the red wine quality improves linearly as alcohol density goes up.
Density:
Density is another main attribute of redwine quality, althought it does not varies widely in different red wine. The variable ‘density’ still has a noticable correlation coefficients as \(-0.177\) which shows that red wine with high density may have a worse quality in some degree.
We noticed that red wine density decreases as quality increases in most cases, except for an increaseing trend from level 4 to level 5. This inconsistent trend divides whole red wine two groups as bad quality group (level 3 to 4) and good quality group (level 5 to 8). This division can also be suitable in analysis of ‘alcohol’, ‘total.sulfur.dioxide’ and ‘total.acidity’. With this division method, these attributes we analyzed would have monotonously increasing/decreasing trend as quality goes up in each quality group.
The main feature of interest in this analysis in ‘quality’. In this section, we found ‘alcohol’ is linearly highly correlated to ‘quality’ with coeffcient \(0.495\). However, there does exist a situation that when red wine has low quality level, it does not have a low alcohol percentage, which might be a result of adding artifical alcohol.
‘volatile.acidity’ and ‘chlorides’ creates unpleasent tastes to red wine, the less ‘volatile.acidity’ and ‘chorides’ in red wine, the better quality it would have. ‘density’ also influence red wine quality in a negative way. Red wine with lower density would have a more clear texture and a better quality.
We also found that it makes more sense to consider ratio between ‘total.acidity’ and ‘residual.sugar’ to analysis redwine quality rather than consider them seperately. Red wine with middle to high quality levels has similar ratio around \(3.75\).
For antimicrobials, ‘sulphates’ contributes to good red wine quality monotonously, while ‘total.sulfur.dioxide’ needs to find a balance value to avoide decrease red wine quality by adding to much sulfur dioxide.
We observed that ‘volatile.acidity’ has a biomodal distribution. After dividing the whole population into two subgroups according to the ‘sulphates’ density level, each subgroups has distribution with only one mode. This means ‘sulphates’ level influence volatile acids density significantly.
The strongest relationship is between ‘quality’ and ‘alcohol’, which seems reasonable for red wine quality.
In this section, we will futher explore the relationship between red wine quality and related factors. Here we would see the how the relationship between red wine quality and alcohol changes under influence of other factors.
Remark: for better intuitive understanding:
As you can see from above, when red wine has higher alcohol density, its quality is more likely to be better. For red wine with quality level less than 6, the large data trunk accumulated around alcohol percentage 10. As red wine quality goes better, the data trunck for each level moves more the right hand side, which represents higher alcohol density. We are interested in whether this patter would be kept in different red wine sample groups.
From the two plots above, we again confirmed our previous conclusion that red wine with lower density level has larger proprotion of high quality level than red wine with higher density levels. Further more, from the second plot we can see that in three differnet density level groups, the basic pattern of relationship between quality and alcohol remains.
From the two plots above, it is not hard to find that volatile acids do influence red wine quality in a negative way. In the second plot we see that the basic pattern remains in each group.
These two plots give us similar conclusion above: the basic patter if relationship between quality and alcohol remains and adding chlorides gave red wine unpleasent smell and hencedecreased red wine quality.
Instead of just analyzing the basic pattern of relationship between red wine quality and alcohol density, we would like to check more comlplex patterns combine one more related factor in different sample subgoups.
We can draw conclusion from the above plot:
In each subgroup, the density pattern does not change, which means the red wine with lower density level tends to have better quality and higher alcohol percentage.
Different cholorides density levels influence red wine density significantly while the different volatile acids density levels do not.
However, the volatile acides density does not hold same pattern in every subgroup. In the three subgroups in the last row, the distribution of data points seems to have a random trend rather than a trend that red wine with lower density volatile acids has better quality. One reasonable sepculation is that when red wine has high density, it is hard to tell the unpleasent taste created by volatile acids. It might be covered by other tastes.
In this plot, the pattern of chlorides density level is only significant in the first column. In mid and high red wine density group, red wine with low density level of chlorides does not show a high level of red wine quality. The reason might be the same as above, as red wine density increases, the unpleasent smell of chlorides might be covered by other smell or tastes.
In this section, we explored the relationship between ‘quality’ and ‘alcohol’ along with three related variables, ‘density’, ‘chlorides’ and
‘volatile.acidity’. First of all, we already know that ‘quality’ is positively correlated with ‘alcohol’ in the last setction. Then by the observation in this section, we noticed that in mid or high level of red wine density subgroup, the unpleasent smell or tastes created by chlorides and volatile acids might be covered by other tastes so that the patterns of chlorides and volatile acids density level changes in those subgroups.
Red wine density level has a stable pattern in different subgroups. Hence ‘density’ is a critical attributes to describe the red wine quality. However, the chlorides and volatile acids density levels do not hold a stable pattern in different subgoups. Their influence to red wine quality is more obvious in low density red wine subgroups.
Plot 1 explained the reason why volatile acidity distribution is bimodal. From this plot we observed that when we divided the red wine sample population into two subgroups via sulphates density level, in each subgroup, the volatile acids distribution has only one mode. Thus sulphates density influence volatile acidity in an obvious way.
In the second plot, we explored the balance between sourness and sweetness in red wine. In the left figure, we use box plot to show the distribution of the ratio between total acids density and residual sugar in different quality level. In the right figure, we use scatter plot and linear regression of explore the relationship between total acids density and redsidual sugar density. In both left figure and right figure, we can draw a conclusion that red wine with quality level higher than 4 has a relatively stable ratio between total acids density and residual sugar density.
The last plot shows that the red wine density influence red wine quality in a same way in different subgroups. From this phenomenon we can infer that the red wine density is a critial feature to judge red wine quality.
This exploratory data analysis is about a dataset including information about different ingredients of red wine samples. This dataset has 1599 observations and 12 variables except for the index. In initial phase, I explored each variables by doing univariable analysis. In this preliminary exploration, I noticed that some of the variables have long-tail distribution. Hence I transformed the x-axis scale to ‘log10’ scale and varified the log distribution is bell-like and symmetric. I noticed that the quite a few red wine samples have ‘citric.acid’ value equal to 0. This made me start to wondering whether citric acid is a good additive in red wine, and how about other additives. By showing the correlation matirx, I decided to focus on exploring the variables of higher than 0.2 correlation coefficients with ‘quality’ in pairs. Next I compared those pairs along with other variables together.
The main finds can be summarized as follows. In bivariate analysis, I obtained two surprising observations. The first one is that for red wine with medium and high level quality, the ratio between sourness and sweetness is stable. Namely the total acids and residual sugar have a almost fixed density ratio in these red wine samples. The second one is that volatile acids would create unpleasent tastes to red wine, and it can be reduced by adding sulphates. Further more, I also found that the influence of different variables to red wine quality is different in low quality level red wine sample and higher quality level red wine samples and ‘alcohol’ has the strongest linear correlation to ‘quality’. Next, in multivariate analysis, I found noticed that ‘density’ is a critical feature of red wine quality, because the patterns of red wine density in different subgroups have same trend that red wine with lower density is more likely to be in a better quality level. Meanwhile, ‘volatile.acidity’ and ‘chlorides’ does not influence red wine quality in a significant way if red wine has medium to high density. The unpleasent smell and tastes created by these two ingredients can be covered by other tastes. In short, I found two important features to judge red wine quality: ‘alcohol’ and ‘density’. There are still other related features that influence red wine quality: ‘chlorides’, ‘sulphates’, ‘volatile.acidity’ and the ratio between ‘total.acidity’ and ‘residual.sugar’.
Further interests based on this analysis could be as follows:
How additives (such as ‘chlorides’, ‘sulphates’, ‘total.sulfur.dioxide’ and ‘citirc.acid’) work together to enhance red wine quality? What is the best density level of each additives?
what is the relationship between ‘total.sulfur.dioxide’, ‘free.sulfur.dioxide’ and ‘sulphates’? Does any one of these three have dominating effects?
Inferential statistical modeling methods can be applied based on this analysis. Since the red wine quality is categorical variable, logistic regression is a reasonable choice.